Filter columns and rows¶

  • Display feature summaries
  • Explore features
  • Filter data

The most recent data in the LendingClub dataset is from 2018, and since then, LendingClub has stopped operating as a peer-to-peer lender. Unsurprisingly, it's difficult to find explanations on the LendingClub website about the features in this dataset.

Sites not officially associated with LendingClub still contain information about the peer-to-peer service previously offered by LendingClub. As a result, the feature exploration for this project includes links to miscellaneous pages such as blogs.

Beginning with the current notebook, however, lack of detailed information about features does impose some limits. For instance, rows containing certain values of loan_status are filtered out simply because it is difficult to understand what those values mean.

This notebook does the following:

  • Explore features to determine what filtering should be done.
  • Filter out certain columns and rows from the data on accepted loans.

The information available on rejected loans is fairly limited, so for now, I will limit attention to the data on accepted loans.

The data-cleaning steps developed in this notebook have been incorporated into the project's Python package notebook_tools.

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
from IPython.display import display

from notebook_tools.data_cleaning import (
    convert_acc_loan_data,
    convert_rej_loan_data,
    load_acc_loan_data,
    load_acc_loan_feat_desc,
    load_rej_loan_data,
)
from notebook_tools.feature_exploration import (
    get_group_sizes,
    get_value_counts,
    style_loan_summary,
    style_value_counts,
    summarize_acc_loans,
    summarize_loan_data,
)

Display feature summaries¶

Use functions in the package notebook_tools to load data and generate feature summaries.

In [2]:
acc_loan_data = load_acc_loan_data(excluded_cols=["member_id"]).pipe(
    convert_acc_loan_data
)
In [3]:
acc_loan_feat_desc = load_acc_loan_feat_desc()
In [4]:
rej_loan_data = load_rej_loan_data().pipe(convert_rej_loan_data)

Feature summaries for accepted loans¶

Total number of records: 2,260,701¶

In [5]:
print(f"The number of records for accepted loans is {len(acc_loan_data.index):,d}.")
The number of records for accepted loans is 2,260,701.
In [6]:
for dtype in [np.number, "string", "boolean"]:
    summary = summarize_acc_loans(acc_loan_data, dtype, acc_loan_feat_desc)
    print(f"\n\nThe number of features of type {dtype} is {len(summary.index)}.\n\n")
    display(style_loan_summary(summary))

The number of features of type <class 'numpy.number'> is 112.


  count mean std min 25% 50% 75% max data type description
loan_amnt 2,260,668 15,046.9 9,190.2 500.0 8,000.0 12,900.0 20,000.0 40,000.0 Float64 The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
funded_amnt 2,260,668 15,041.7 9,188.4 500.0 8,000.0 12,875.0 20,000.0 40,000.0 Float64 The total amount committed to that loan at that point in time.
funded_amnt_inv 2,260,668 15,023.4 9,192.3 0.0 8,000.0 12,800.0 20,000.0 40,000.0 Float64 The total amount committed by investors for that loan at that point in time.
term 2,260,668 42.9 10.9 36.0 36.0 36.0 60.0 60.0 Int64 The number of payments on the loan. Values are in months and can be either 36 or 60.
int_rate 2,260,668 13.1 4.8 5.3 9.5 12.6 16.0 31.0 Float64 Interest Rate on the loan
installment 2,260,668 445.8 267.2 4.9 251.7 378.0 593.3 1,719.8 Float64 The monthly payment owed by the borrower if the loan originates.
annual_inc 2,260,664 77,992.4 112,696.2 0.0 46,000.0 65,000.0 93,000.0 110,000,000.0 Float64 The self-reported annual income provided by the borrower during registration.
dti 2,258,957 18.8 14.2 -1.0 11.9 17.8 24.5 999.0 Float64 A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
delinq_2yrs 2,260,639 0.3 0.9 0.0 0.0 0.0 0.0 58.0 Int64 The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years
fico_range_low 2,260,668 698.6 33.0 610.0 675.0 690.0 715.0 845.0 Int64 The lower boundary range the borrower’s FICO at loan origination belongs to.
fico_range_high 2,260,668 702.6 33.0 614.0 679.0 694.0 719.0 850.0 Int64 The upper boundary range the borrower’s FICO at loan origination belongs to.
inq_last_6mths 2,260,638 0.6 0.9 0.0 0.0 0.0 1.0 33.0 Int64 The number of inquiries in past 6 months (excluding auto and mortgage inquiries)
mths_since_last_delinq 1,102,166 34.5 21.9 0.0 16.0 31.0 50.0 226.0 Int64 The number of months since the borrower's last delinquency.
mths_since_last_record 359,156 72.3 26.5 0.0 55.0 74.0 92.0 129.0 Int64 The number of months since the last public record.
open_acc 2,260,639 11.6 5.6 0.0 8.0 11.0 14.0 101.0 Int64 The number of open credit lines in the borrower's credit file.
pub_rec 2,260,639 0.2 0.6 0.0 0.0 0.0 0.0 86.0 Int64 Number of derogatory public records
revol_bal 2,260,668 16,658.5 22,948.3 0.0 5,950.0 11,324.0 20,246.0 2,904,836.0 Float64 Total credit revolving balance
revol_util 2,258,866 50.3 24.7 0.0 31.5 50.3 69.4 892.3 Float64 Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
total_acc 2,260,639 24.2 12.0 1.0 15.0 22.0 31.0 176.0 Int64 The total number of credit lines currently in the borrower's credit file
out_prncp 2,260,668 4,206.9 7,343.2 0.0 0.0 0.0 6,149.9 40,000.0 Float64 Remaining outstanding principal for total amount funded
out_prncp_inv 2,260,668 4,206.0 7,342.3 0.0 0.0 0.0 6,146.3 40,000.0 Float64 Remaining outstanding principal for portion of total amount funded by investors
total_pymnt 2,260,668 12,082.6 9,901.4 0.0 4,546.5 9,329.7 16,940.9 63,296.9 Float64 Payments received to date for total amount funded
total_pymnt_inv 2,260,668 12,064.4 9,897.0 0.0 4,531.8 9,309.7 16,916.7 63,296.9 Float64 Payments received to date for portion of total amount funded by investors
total_rec_prncp 2,260,668 9,505.8 8,321.9 0.0 3,000.0 7,000.0 13,899.1 40,000.0 Float64 Principal received to date
total_rec_int 2,260,668 2,431.4 2,679.7 0.0 728.2 1,525.9 3,108.1 28,192.5 Float64 Interest received to date
total_rec_late_fee 2,260,668 1.5 11.8 -0.0 0.0 0.0 0.0 1,484.3 Float64 Late fees received to date
recoveries 2,260,668 143.9 748.2 0.0 0.0 0.0 0.0 39,859.6 Float64 post charge off gross recovery
collection_recovery_fee 2,260,668 24.0 131.2 0.0 0.0 0.0 0.0 7,174.7 Float64 post charge off collection fee
last_pymnt_amnt 2,260,668 3,429.3 6,018.2 0.0 310.3 600.6 3,743.8 42,192.1 Float64 Last total payment amount received
last_fico_range_high 2,260,668 687.7 73.0 0.0 654.0 699.0 734.0 850.0 Int64 The upper boundary range the borrower’s last FICO pulled belongs to.
last_fico_range_low 2,260,668 675.5 111.1 0.0 650.0 695.0 730.0 845.0 Int64 The lower boundary range the borrower’s last FICO pulled belongs to.
collections_12_mths_ex_med 2,260,523 0.0 0.2 0.0 0.0 0.0 0.0 20.0 Int64 Number of collections in 12 months excluding medical collections
mths_since_last_major_derog 580,775 44.2 21.5 0.0 27.0 44.0 62.0 226.0 Int64 Months since most recent 90-day or worse rating
annual_inc_joint 120,710 123,624.6 74,161.3 5,693.5 83,400.0 110,000.0 147,995.0 7,874,821.0 Float64 The combined self-reported annual income provided by the co-borrowers during registration
dti_joint 120,706 19.3 7.8 0.0 13.5 18.8 24.6 69.5 Float64 A ratio calculated using the co-borrowers' total monthly payments on the total debt obligations, excluding mortgages and the requested LC loan, divided by the co-borrowers' combined self-reported monthly income
acc_now_delinq 2,260,639 0.0 0.1 0.0 0.0 0.0 0.0 14.0 Int64 The number of accounts on which the borrower is now delinquent.
tot_coll_amt 2,190,392 232.7 8,518.5 0.0 0.0 0.0 0.0 9,152,545.0 Float64 Total collection amounts ever owed
tot_cur_bal 2,190,392 142,492.2 160,692.6 0.0 29,092.0 79,240.0 213,204.0 9,971,659.0 Float64 Total current balance of all accounts
open_acc_6m 1,394,538 0.9 1.1 0.0 0.0 1.0 1.0 18.0 Int64 Number of open trades in last 6 months
open_act_il 1,394,539 2.8 3.0 0.0 1.0 2.0 3.0 57.0 Int64 Number of currently active installment trades
open_il_12m 1,394,539 0.7 0.9 0.0 0.0 0.0 1.0 25.0 Int64 Number of installment accounts opened in past 12 months
open_il_24m 1,394,539 1.6 1.6 0.0 0.0 1.0 2.0 51.0 Int64 Number of installment accounts opened in past 24 months
mths_since_rcnt_il 1,350,744 21.2 26.0 0.0 7.0 13.0 24.0 511.0 Int64 Months since most recent installment accounts opened
total_bal_il 1,394,539 35,506.6 44,097.5 0.0 8,695.0 23,127.0 46,095.0 1,837,038.0 Float64 Total current balance of all installment accounts
il_util 1,191,818 69.1 23.7 0.0 55.0 72.0 86.0 1,000.0 Float64 Ratio of total current balance to high credit/credit limit on all install acct
open_rv_12m 1,394,539 1.3 1.5 0.0 0.0 1.0 2.0 28.0 Int64 Number of revolving trades opened in past 12 months
open_rv_24m 1,394,539 2.7 2.6 0.0 1.0 2.0 4.0 60.0 Int64 Number of revolving trades opened in past 24 months
max_bal_bc 1,394,539 5,806.4 5,690.6 0.0 2,284.0 4,413.0 7,598.0 1,170,668.0 Float64 Maximum current balance owed on all revolving accounts
all_util 1,394,320 57.0 20.9 0.0 43.0 58.0 72.0 239.0 Float64 Balance to credit limit on all trades
total_rev_hi_lim 2,190,392 34,573.9 36,728.5 0.0 14,700.0 25,400.0 43,200.0 9,999,999.0 Float64 Total revolving high credit/credit limit
inq_fi 1,394,539 1.0 1.5 0.0 0.0 1.0 1.0 48.0 Int64 Number of personal finance inquiries
total_cu_tl 1,394,538 1.5 2.7 0.0 0.0 0.0 2.0 111.0 Int64 Number of finance trades
inq_last_12m 1,394,538 2.0 2.4 0.0 0.0 1.0 3.0 67.0 Int64 Number of credit inquiries in past 12 months
acc_open_past_24mths 2,210,638 4.5 3.2 0.0 2.0 4.0 6.0 64.0 Int64 Number of trades opened in past 24 months.
avg_cur_bal 2,190,322 13,547.8 16,474.1 0.0 3,080.0 7,335.0 18,783.0 958,084.0 Float64 Average current balance of all accounts
bc_open_to_buy 2,185,733 11,394.3 16,599.5 0.0 1,722.0 5,442.0 14,187.0 711,140.0 Float64 Total open to buy on revolving bankcards.
bc_util 2,184,597 57.9 28.6 0.0 35.4 60.2 83.1 339.6 Float64 Ratio of total current balance to high credit/credit limit for all bankcard accounts.
chargeoff_within_12_mths 2,260,523 0.0 0.1 0.0 0.0 0.0 0.0 10.0 Int64 Number of charge-offs within 12 months
delinq_amnt 2,260,639 12.4 726.5 0.0 0.0 0.0 0.0 249,925.0 Float64 The past-due amount owed for the accounts on which the borrower is now delinquent.
mo_sin_old_il_acct 2,121,597 125.7 53.4 0.0 96.0 130.0 154.0 999.0 Int64 Months since oldest bank installment account opened
mo_sin_old_rev_tl_op 2,190,391 181.5 97.1 1.0 116.0 164.0 232.0 999.0 Int64 Months since oldest revolving account opened
mo_sin_rcnt_rev_tl_op 2,190,391 14.0 17.5 0.0 4.0 8.0 17.0 547.0 Int64 Months since most recent revolving account opened
mo_sin_rcnt_tl 2,190,392 8.3 9.2 0.0 3.0 6.0 11.0 382.0 Int64 Months since most recent account opened
mort_acc 2,210,638 1.6 1.9 0.0 0.0 1.0 3.0 94.0 Int64 Number of mortgage accounts.
mths_since_recent_bc 2,187,256 24.8 32.3 0.0 6.0 14.0 30.0 661.0 Int64 Months since most recent bankcard account opened.
mths_since_recent_bc_dlq 519,701 39.3 22.6 0.0 21.0 37.0 57.0 202.0 Int64 Months since most recent bankcard delinquency
mths_since_recent_inq 1,965,233 7.0 6.0 0.0 2.0 5.0 11.0 25.0 Int64 Months since most recent inquiry.
mths_since_recent_revol_delinq 740,359 35.8 22.3 0.0 17.0 33.0 51.0 202.0 Int64 Months since most recent revolving delinquency.
num_accts_ever_120_pd 2,190,392 0.5 1.4 0.0 0.0 0.0 0.0 58.0 Int64 Number of accounts ever 120 or more days past due
num_actv_bc_tl 2,190,392 3.7 2.3 0.0 2.0 3.0 5.0 50.0 Int64 Number of currently active bankcard accounts
num_actv_rev_tl 2,190,392 5.6 3.4 0.0 3.0 5.0 7.0 72.0 Int64 Number of currently active revolving trades
num_bc_sats 2,202,078 4.8 3.0 0.0 3.0 4.0 6.0 71.0 Int64 Number of satisfactory bankcard accounts
num_bc_tl 2,190,392 7.7 4.7 0.0 4.0 7.0 10.0 86.0 Int64 Number of bankcard accounts
num_il_tl 2,190,392 8.4 7.4 0.0 3.0 6.0 11.0 159.0 Int64 Number of installment accounts
num_op_rev_tl 2,190,392 8.2 4.7 0.0 5.0 7.0 10.0 91.0 Int64 Number of open revolving accounts
num_rev_accts 2,190,391 14.0 8.0 0.0 8.0 12.0 18.0 151.0 Int64 Number of revolving accounts
num_rev_tl_bal_gt_0 2,190,392 5.6 3.3 0.0 3.0 5.0 7.0 65.0 Int64 Number of revolving trades with balance >0
num_sats 2,202,078 11.6 5.6 0.0 8.0 11.0 14.0 101.0 Int64 Number of satisfactory accounts
num_tl_120dpd_2m 2,107,011 0.0 0.0 0.0 0.0 0.0 0.0 7.0 Int64 Number of accounts currently 120 days past due (updated in past 2 months)
num_tl_30dpd 2,190,392 0.0 0.1 0.0 0.0 0.0 0.0 4.0 Int64 Number of accounts currently 30 days past due (updated in past 2 months)
num_tl_90g_dpd_24m 2,190,392 0.1 0.5 0.0 0.0 0.0 0.0 58.0 Int64 Number of accounts 90 or more days past due in last 24 months
num_tl_op_past_12m 2,190,392 2.1 1.8 0.0 1.0 2.0 3.0 32.0 Int64 Number of accounts opened in past 12 months
pct_tl_nvr_dlq 2,190,237 94.1 9.0 0.0 91.3 100.0 100.0 100.0 Float64 Percent of trades never delinquent
percent_bc_gt_75 2,185,289 42.4 36.2 0.0 0.0 37.5 71.4 100.0 Float64 Percentage of all bankcard accounts > 75% of limit.
pub_rec_bankruptcies 2,259,303 0.1 0.4 0.0 0.0 0.0 0.0 12.0 Int64 Number of public record bankruptcies
tax_liens 2,260,563 0.0 0.4 0.0 0.0 0.0 0.0 85.0 Int64 Number of tax liens
tot_hi_cred_lim 2,190,392 178,242.8 181,574.8 0.0 50,731.0 114,298.5 257,755.0 9,999,999.0 Float64 Total high credit/credit limit
total_bal_ex_mort 2,210,638 51,022.9 49,911.2 0.0 20,892.0 37,864.0 64,350.0 3,408,095.0 Float64 Total credit balance excluding mortgage
total_bc_limit 2,210,638 23,193.8 23,006.6 0.0 8,300.0 16,300.0 30,300.0 1,569,000.0 Float64 Total bankcard high credit/credit limit
total_il_high_credit_limit 2,190,392 43,732.0 45,073.0 0.0 15,000.0 32,696.0 58,804.2 2,118,996.0 Float64 Total installment high credit/credit limit
revol_bal_joint 108,020 33,617.3 28,153.9 0.0 15,106.8 26,516.5 43,769.0 1,110,019.0 Float64 Sum of revolving credit balance of the co-borrowers, net of duplicate balances
sec_app_fico_range_low 108,021 669.8 44.7 540.0 645.0 670.0 695.0 845.0 Int64 FICO range (high) for the secondary applicant
sec_app_fico_range_high 108,021 673.8 44.7 544.0 649.0 674.0 699.0 850.0 Int64 FICO range (low) for the secondary applicant
sec_app_inq_last_6mths 108,021 0.6 1.0 0.0 0.0 0.0 1.0 6.0 Int64 Credit inquiries in the last 6 months at time of application for the secondary applicant
sec_app_mort_acc 108,021 1.5 1.8 0.0 0.0 1.0 2.0 27.0 Int64 Number of mortgage accounts at time of application for the secondary applicant
sec_app_open_acc 108,021 11.5 6.6 0.0 7.0 10.0 15.0 82.0 Int64 Number of open trades at time of application for the secondary applicant
sec_app_revol_util 106,184 58.2 25.5 0.0 39.8 60.2 78.6 434.3 Float64 Ratio of total current balance to high credit/credit limit for all revolving accounts
sec_app_open_act_il 108,021 3.0 3.3 0.0 1.0 2.0 4.0 43.0 Int64 Number of currently active installment trades at time of application for the secondary applicant
sec_app_num_rev_accts 108,021 12.5 8.2 0.0 7.0 11.0 17.0 106.0 Int64 Number of revolving accounts at time of application for the secondary applicant
sec_app_chargeoff_within_12_mths 108,021 0.0 0.4 0.0 0.0 0.0 0.0 21.0 Int64 Number of charge-offs within last 12 months at time of application for the secondary applicant
sec_app_collections_12_mths_ex_med 108,021 0.1 0.4 0.0 0.0 0.0 0.0 23.0 Int64 Number of collections within last 12 months excluding medical collections at time of application for the secondary applicant
sec_app_mths_since_last_major_derog 35,942 36.9 23.9 0.0 16.0 36.0 56.0 185.0 Int64 Months since most recent 90-day or worse rating at time of application for the secondary applicant
deferral_term 10,917 3.0 0.0 3.0 3.0 3.0 3.0 3.0 Int64 Amount of months that the borrower is expected to pay less than the contractual monthly payment amount due to a hardship plan
hardship_amount 10,917 155.0 129.0 0.6 59.4 119.1 213.3 943.9 Float64 The interest payment that the borrower has committed to make each month while they are on a hardship plan
hardship_length 10,917 3.0 0.0 3.0 3.0 3.0 3.0 3.0 Int64 The number of months the borrower will make smaller payments than normally obligated due to a hardship plan
hardship_dpd 10,917 13.7 9.7 0.0 5.0 15.0 22.0 37.0 Int64 Account days past due as of the hardship plan start date
orig_projected_additional_accrued_interest 8,651 454.8 375.4 1.9 175.2 352.8 620.2 2,680.9 Float64 The original projected additional interest amount that will accrue for the given hardship payment plan as of the Hardship Start Date. This field will be null if the borrower has broken their hardship payment plan.
hardship_payoff_balance_amount 10,917 11,636.9 7,626.0 55.7 5,627.0 10,028.4 16,151.9 40,306.4 Float64 The payoff balance amount as of the hardship plan start date
hardship_last_payment_amount 10,917 194.0 198.6 0.0 44.4 133.2 284.2 1,407.9 Float64 The last payment amount as of the hardship plan start date
settlement_amount 34,246 5,010.7 3,693.1 44.2 2,208.0 4,146.1 6,850.2 33,601.0 Float64 The loan amount that the borrower has agreed to settle for
settlement_percentage 34,246 47.8 7.3 0.2 45.0 45.0 50.0 521.4 Float64 The settlement amount as a percentage of the payoff balance amount on the loan
settlement_term 34,246 13.2 8.2 0.0 6.0 14.0 18.0 181.0 Int64 The number of months that the borrower will be on the settlement plan

The number of features of type string is 35.


  count unique top freq data type description
id 2,260,701 2,260,701 68407277 1 string A unique LC assigned ID for the loan listing.
grade 2,260,668 7 B 663,557 string LC assigned loan grade
sub_grade 2,260,668 35 C1 145,903 string LC assigned loan subgrade
emp_title 2,093,699 512,694 Teacher 38,824 string The job title supplied by the Borrower when applying for the loan.*
emp_length 2,113,761 11 10+ years 748,005 string Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
home_ownership 2,260,668 6 MORTGAGE 1,111,450 string The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER
verification_status 2,260,668 3 Source Verified 886,231 string Indicates if income was verified by LC, not verified, or if the income source was verified
issue_d 2,260,668 139 2016-03 61,992 string The month which the loan was funded
loan_status 2,260,668 9 Fully Paid 1,076,751 string Current status of the loan
url 2,260,668 2,260,668 https://lendingclub.com/browse/loanDetail.action?loan_id=68407277 1 string URL for the LC page with listing data.
desc 126,065 124,500 252 string Loan description provided by the borrower
purpose 2,260,668 14 debt_consolidation 1,277,877 string A category provided by the borrower for the loan request.
title 2,237,342 63,154 Debt consolidation 1,153,293 string The loan title provided by the borrower
zip_code 2,260,667 956 112xx 23,908 string The first 3 numbers of the zip code provided by the borrower in the loan application.
addr_state 2,260,668 51 CA 314,533 string The state provided by the borrower in the loan application
earliest_cr_line 2,260,639 754 2004-09 15,400 string The month the borrower's earliest reported credit line was opened
initial_list_status 2,260,668 2 w 1,535,467 string The initial listing status of the loan. Possible values are – W, F
last_pymnt_d 2,258,241 136 2019-03 853,003 string Last month payment was received
next_pymnt_d 915,358 106 2019-04 912,221 string Next scheduled payment date
last_credit_pull_d 2,260,596 141 2019-03 1,371,381 string The most recent month LC pulled credit for this loan
policy_code 2,260,668 1 1.0 2,260,668 string publicly available policy_code=1 new products not publicly available policy_code=2
application_type 2,260,668 2 Individual 2,139,958 string Indicates whether the loan is an individual application or a joint application with two co-borrowers
verification_status_joint 115,730 3 Not Verified 57,403 string Indicates if the co-borrowers' joint income was verified by LC, not verified, or if the income source was verified
sec_app_earliest_cr_line 108,021 663 2006-08 998 string Earliest credit line at time of application for the secondary applicant
hardship_type 10,917 1 INTEREST ONLY-3 MONTHS DEFERRAL 10,917 string Describes the hardship plan offering
hardship_reason 10,917 9 NATURAL_DISASTER 2,965 string Describes the reason the hardship plan was offered
hardship_status 10,917 3 COMPLETED 7,819 string Describes if the hardship plan is active, pending, canceled, completed, or broken
hardship_start_date 10,917 27 2017-09 2,444 string The start date of the hardship plan period
hardship_end_date 10,917 28 2017-12 1,756 string The end date of the hardship plan period
payment_plan_start_date 10,917 27 2017-09 1,715 string The day the first hardship plan payment is due. For example, if a borrower has a hardship plan period of 3 months, the start date is the start of the three-month period in which the borrower is allowed to make interest-only payments.
hardship_loan_status 10,917 5 Late (16-30 days) 4,770 string Loan Status as of the hardship plan start date
disbursement_method 2,260,668 2 Cash 2,182,546 string The method by which the borrower receives their loan. Possible values are: CASH, DIRECT_PAY
debt_settlement_flag_date 34,246 83 2019-02 2,606 string The most recent date that the Debt_Settlement_Flag has been set  
settlement_status 34,246 3 ACTIVE 14,704 string The status of the borrower’s settlement plan. Possible values are: COMPLETE, ACTIVE, BROKEN, CANCELLED, DENIED, DRAFT
settlement_date 34,246 90 2019-01 1,710 string The date that the borrower agrees to the settlement plan

The number of features of type boolean is 3.


  count unique top freq data type description
pymnt_plan 2,260,668 2 False 2,260,048 boolean Indicates if a payment plan has been put in place for the loan
hardship_flag 2,260,668 2 False 2,259,836 boolean Flags whether or not the borrower is on a hardship plan
debt_settlement_flag 2,260,668 2 False 2,226,422 boolean Flags whether or not the borrower, who has charged-off, is working with a debt-settlement company.

Feature summaries for rejected loans¶

Total number of records: 27,648,741¶

In [7]:
print(f"The number of records for accepted loans is {len(rej_loan_data.index):,d}.")
The number of records for accepted loans is 27,648,741.
In [8]:
for dtype in [np.number, "string"]:
    summary = summarize_loan_data(rej_loan_data, dtype)
    display(style_loan_summary(summary))
  count mean std min 25% 50% 75% max data type
Amount Requested 27,648,741 13,133.2 15,009.6 0.0 4,800.0 10,000.0 20,000.0 1,400,000.0 Float64
Risk_Score 9,151,111 628.2 89.9 0.0 591.0 637.0 675.0 990.0 Float64
Debt-To-Income Ratio 27,648,741 143.3 10,539.2 -1.0 8.1 20.0 36.6 50,000,031.5 Float64
  count unique top freq data type
Application Date 27,648,741 4,238 2018-12-04 42,112 string
Loan Title 27,647,436 73,927 Debt consolidation 6,418,016 string
Zip Code 27,648,448 1,001 112xx 267,102 string
State 27,648,719 51 CA 3,242,169 string
Employment Length 26,697,386 11 < 1 year 22,994,315 string
Policy Code 27,647,823 2 0.0 27,559,694 string

Explore features¶

policy_code / Policy Code¶

What do the columns policy_code (for accepted loans) and Policy Code (for rejected loans) refer to?

From "What are these Policy Code 2 Loans at Lending Club?":

  • These [Policy Code 2 loans] are loans made to borrowers that do not meet Lending Club’s current credit policy standards.
  • The FICO scores on these borrowers are typically 640-659, below the 660 threshold on Policy Code 1 loans.
  • These loans are made available to select institutional investors who have a great deal of experience with consumer loans in this credit spectrum and with Lending Club.
In [9]:
policy_code_counts = get_value_counts(acc_loan_data["policy_code"])
display(style_value_counts(policy_code_counts))
  count
policy_code  
1.0 2,260,668
<NA> 33
In [10]:
policy_code_counts_rej = get_value_counts(rej_loan_data["Policy Code"])
display(style_value_counts(policy_code_counts_rej))
  count
Policy Code  
0.0 27,559,694
2.0 88,129
<NA> 918

loan_status¶

What are the distinct values for the column loan_status?

In [11]:
loan_status_counts = get_value_counts(acc_loan_data["loan_status"])
display(style_value_counts(loan_status_counts))
  count
loan_status  
Fully Paid 1,076,751
Current 878,317
Charged Off 268,559
Late (31-120 days) 21,467
In Grace Period 8,436
Late (16-30 days) 4,349
Does not meet the credit policy. Status:Fully Paid 1,988
Does not meet the credit policy. Status:Charged Off 761
Default 40
<NA> 33

The 33 rows that have NA for loan_status also have NA for all other features other than id, so these rows can be filtered from the data.

From the values of id displayed in the output of next cell, these null rows appear to be associated with the policy code.

In [12]:
missing_status = acc_loan_data[acc_loan_data["loan_status"].isna()]
In [13]:
display(missing_status.head(4).transpose())
421095 421096 528961 528962
id Total amount funded in policy code 1: 6417608175 Total amount funded in policy code 2: 1944088810 Total amount funded in policy code 1: 1741781700 Total amount funded in policy code 2: 564202131
loan_amnt <NA> <NA> <NA> <NA>
funded_amnt <NA> <NA> <NA> <NA>
funded_amnt_inv <NA> <NA> <NA> <NA>
term <NA> <NA> <NA> <NA>
... ... ... ... ...
settlement_status NaN NaN NaN NaN
settlement_date NaN NaN NaN NaN
settlement_amount NaN NaN NaN NaN
settlement_percentage NaN NaN NaN NaN
settlement_term NaN NaN NaN NaN

150 rows × 4 columns

As a check, verify that if the id column is dropped, then all values are NA in rows that are missing loan_status.

In [14]:
display(missing_status.drop("id", axis="columns").count().sum())
0

Create a dataframe that has these empty rows filtered out. After additional filtering, this dataframe will be used to recreate the SQLite database.

In [15]:
filtered_loan_data = acc_loan_data[acc_loan_data["loan_status"].notna()]
In [16]:
loan_status_counts = get_value_counts(filtered_loan_data["loan_status"])
display(style_value_counts(loan_status_counts))
  count
loan_status  
Fully Paid 1,076,751
Current 878,317
Charged Off 268,559
Late (31-120 days) 21,467
In Grace Period 8,436
Late (16-30 days) 4,349
Does not meet the credit policy. Status:Fully Paid 1,988
Does not meet the credit policy. Status:Charged Off 761
Default 40

Note that after the rows with missing loan_status have been filtered out, there are no missing values for policy_code. Since all rows have the same value for policy_code, this column can be dropped.

In [17]:
policy_code_counts = get_value_counts(filtered_loan_data["policy_code"])
display(style_value_counts(policy_code_counts))
  count
policy_code  
1.0 2,260,668

Most of the values for loan_status are explained at "What Do the Different Note Statuses Mean?".

However, the values Does not meet the credit policy. Status:Fully Paid and Does not meet the credit policy. Status:Charged Off are unclear. Let's take look at a random sample of the rows that have these value of loan status.

In [18]:
bool_index = filtered_loan_data["loan_status"].str.endswith("Status:Fully Paid")
sampled_data = filtered_loan_data[bool_index].sample(
    n=5, random_state=59147, axis="index"
)
with pd.option_context("display.max_columns", None):
    display(sampled_data)
id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan url desc purpose title zip_code addr_state dti delinq_2yrs earliest_cr_line fico_range_low fico_range_high inq_last_6mths mths_since_last_delinq mths_since_last_record open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d last_pymnt_amnt next_pymnt_d last_credit_pull_d last_fico_range_high last_fico_range_low collections_12_mths_ex_med mths_since_last_major_derog policy_code application_type annual_inc_joint dti_joint verification_status_joint acc_now_delinq tot_coll_amt tot_cur_bal open_acc_6m open_act_il open_il_12m open_il_24m mths_since_rcnt_il total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m acc_open_past_24mths avg_cur_bal bc_open_to_buy bc_util chargeoff_within_12_mths delinq_amnt mo_sin_old_il_acct mo_sin_old_rev_tl_op mo_sin_rcnt_rev_tl_op mo_sin_rcnt_tl mort_acc mths_since_recent_bc mths_since_recent_bc_dlq mths_since_recent_inq mths_since_recent_revol_delinq num_accts_ever_120_pd num_actv_bc_tl num_actv_rev_tl num_bc_sats num_bc_tl num_il_tl num_op_rev_tl num_rev_accts num_rev_tl_bal_gt_0 num_sats num_tl_120dpd_2m num_tl_30dpd num_tl_90g_dpd_24m num_tl_op_past_12m pct_tl_nvr_dlq percent_bc_gt_75 pub_rec_bankruptcies tax_liens tot_hi_cred_lim total_bal_ex_mort total_bc_limit total_il_high_credit_limit revol_bal_joint sec_app_fico_range_low sec_app_fico_range_high sec_app_earliest_cr_line sec_app_inq_last_6mths sec_app_mort_acc sec_app_open_acc sec_app_revol_util sec_app_open_act_il sec_app_num_rev_accts sec_app_chargeoff_within_12_mths sec_app_collections_12_mths_ex_med sec_app_mths_since_last_major_derog hardship_flag hardship_type hardship_reason hardship_status deferral_term hardship_amount hardship_start_date hardship_end_date payment_plan_start_date hardship_length hardship_dpd hardship_loan_status orig_projected_additional_accrued_interest hardship_payoff_balance_amount hardship_last_payment_amount disbursement_method debt_settlement_flag debt_settlement_flag_date settlement_status settlement_date settlement_amount settlement_percentage settlement_term
1653100 389258 5000.0 5000.0 4688.548225 36 16.0 175.79 E E2 Mustang Harrys Restaurant 2 years RENT 21000.0 Not Verified 2009-04 Does not meet the credit policy. Status:Fully ... False https://lendingclub.com/browse/loanDetail.acti... i am trying to buy a car and pay for the insur... major_purchase personal 104xx NY 14.0 0 2006-11 680 684 2 <NA> <NA> 14 0 6348.0 52.0 18 f 0.0 0.0 6328.258462 5943.66 4999.99 1328.26 0.0 0.0 0.0 2012-04 197.99 2012-05 2012-04 714 710 0 <NA> 1.0 Individual <NA> <NA> <NA> 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0 0.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> False <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Cash False <NA> <NA> <NA> <NA> <NA> <NA>
1652370 507550 6400.0 6400.0 6400.0 36 12.73 214.83 C C1 Advanced Cash Register Systems Inc 1 year MORTGAGE 43000.0 Not Verified 2010-04 Does not meet the credit policy. Status:Fully ... False https://lendingclub.com/browse/loanDetail.acti... Borrower added on 04/22/10 > Our main object... debt_consolidation Debt Consolidation loan 531xx WI 8.4 0 1994-09 695 699 4 <NA> 80 7 1 4686.0 27.7 30 f 0.0 0.0 7734.435939 7734.44 6400.0 1334.44 0.0 0.0 0.0 2013-05 244.7 2013-06 2019-03 654 650 0 <NA> 1.0 Individual <NA> <NA> <NA> 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0 0.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> False <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Cash False <NA> <NA> <NA> <NA> <NA> <NA>
1652294 522251 6000.0 6000.0 6000.0 60 19.04 155.78 F F3 Blueline Software Services 1 year MORTGAGE 60000.0 Not Verified 2010-05 Does not meet the credit policy. Status:Fully ... False https://lendingclub.com/browse/loanDetail.acti... Borrower added on 05/26/10 > I plan on using... other Financial Freedom 070xx NJ 9.54 1 1999-05 675 679 4 14 <NA> 18 0 7376.0 53.8 28 f 0.0 0.0 7846.999584 7847.0 6000.0 1847.0 0.0 0.0 0.0 2012-04 4581.37 2012-05 2012-04 624 620 0 <NA> 1.0 Individual <NA> <NA> <NA> 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0 0.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> False <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Cash False <NA> <NA> <NA> <NA> <NA> <NA>
1651865 601025 2800.0 2800.0 2800.0 36 14.46 96.33 D D2 Plasma Biological Services 3 years OWN 41000.0 Source Verified 2010-10 Does not meet the credit policy. Status:Fully ... False https://lendingclub.com/browse/loanDetail.acti... Borrower added on 10/19/10 > Lenders:<br/>I ... debt_consolidation Good Bye Credit Card Debt! 646xx MO 20.81 1 2003-08 665 669 6 11 <NA> 20 0 3442.0 35.5 24 f 0.0 0.0 3464.22895 3464.23 2800.0 664.23 0.0 0.0 0.0 2013-09 285.98 2013-10 2019-03 664 660 0 <NA> 1.0 Individual <NA> <NA> <NA> 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0 0.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> False <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Cash False <NA> <NA> <NA> <NA> <NA> <NA>
1652535 489091 14725.0 14725.0 14725.0 36 14.22 504.84 C C5 TradeStation Securities 1 year MORTGAGE 60000.0 Not Verified 2010-03 Does not meet the credit policy. Status:Fully ... False https://lendingclub.com/browse/loanDetail.acti... Borrower added on 02/25/10 > 1. I am employe... debt_consolidation You make money while I save on interest 330xx FL 9.09 0 2005-10 700 704 4 <NA> <NA> 11 0 226.0 3.4 14 f 0.0 0.0 18174.993854 18174.99 14724.99 3450.0 0.0 0.0 0.0 2013-03 551.17 2013-04 2013-03 639 635 0 <NA> 1.0 Individual <NA> <NA> <NA> 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0 0.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> False <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Cash False <NA> <NA> <NA> <NA> <NA> <NA>
In [19]:
bool_index = filtered_loan_data["loan_status"].str.endswith("Status:Charged Off")
sampled_data = filtered_loan_data[bool_index].sample(
    n=5, random_state=59147, axis="index"
)
with pd.option_context("display.max_columns", None):
    display(sampled_data)
id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan url desc purpose title zip_code addr_state dti delinq_2yrs earliest_cr_line fico_range_low fico_range_high inq_last_6mths mths_since_last_delinq mths_since_last_record open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d last_pymnt_amnt next_pymnt_d last_credit_pull_d last_fico_range_high last_fico_range_low collections_12_mths_ex_med mths_since_last_major_derog policy_code application_type annual_inc_joint dti_joint verification_status_joint acc_now_delinq tot_coll_amt tot_cur_bal open_acc_6m open_act_il open_il_12m open_il_24m mths_since_rcnt_il total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m acc_open_past_24mths avg_cur_bal bc_open_to_buy bc_util chargeoff_within_12_mths delinq_amnt mo_sin_old_il_acct mo_sin_old_rev_tl_op mo_sin_rcnt_rev_tl_op mo_sin_rcnt_tl mort_acc mths_since_recent_bc mths_since_recent_bc_dlq mths_since_recent_inq mths_since_recent_revol_delinq num_accts_ever_120_pd num_actv_bc_tl num_actv_rev_tl num_bc_sats num_bc_tl num_il_tl num_op_rev_tl num_rev_accts num_rev_tl_bal_gt_0 num_sats num_tl_120dpd_2m num_tl_30dpd num_tl_90g_dpd_24m num_tl_op_past_12m pct_tl_nvr_dlq percent_bc_gt_75 pub_rec_bankruptcies tax_liens tot_hi_cred_lim total_bal_ex_mort total_bc_limit total_il_high_credit_limit revol_bal_joint sec_app_fico_range_low sec_app_fico_range_high sec_app_earliest_cr_line sec_app_inq_last_6mths sec_app_mort_acc sec_app_open_acc sec_app_revol_util sec_app_open_act_il sec_app_num_rev_accts sec_app_chargeoff_within_12_mths sec_app_collections_12_mths_ex_med sec_app_mths_since_last_major_derog hardship_flag hardship_type hardship_reason hardship_status deferral_term hardship_amount hardship_start_date hardship_end_date payment_plan_start_date hardship_length hardship_dpd hardship_loan_status orig_projected_additional_accrued_interest hardship_payoff_balance_amount hardship_last_payment_amount disbursement_method debt_settlement_flag debt_settlement_flag_date settlement_status settlement_date settlement_amount settlement_percentage settlement_term
1654218 145926 10475.0 10475.0 5925.003199 36 14.38 359.95 E E4 Sampson Community College 5 years OWN 76500.0 Not Verified 2007-11 Does not meet the credit policy. Status:Charge... False https://lendingclub.com/browse/loanDetail.acti... I would like to consolidate credit card and lo... credit_card Harold 283xx NC 9.98 0 1994-12 640 644 6 71 0 9 0 15918.0 101.4 16 f 0.0 0.0 6763.02 3823.8 4662.27 1806.22 29.962676 264.57 2.73 2009-06 359.95 2009-12 2017-02 504 500 0 <NA> 1.0 Individual <NA> <NA> <NA> 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0 0.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> False <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Cash False <NA> <NA> <NA> <NA> <NA> <NA>
1652310 519096 10000.0 10000.0 9975.0 60 17.93 253.56 E E5 WORLD MICRO 4 years OWN 70000.0 Not Verified 2010-05 Does not meet the credit policy. Status:Charge... False https://lendingclub.com/browse/loanDetail.acti... Borrower added on 05/17/10 > I am looking to... debt_consolidation VETERAN CREDIT CARD RELIEF 301xx GA 10.8 0 1998-11 665 669 5 57 64 7 1 11838.0 56.1 15 f 0.0 0.0 13902.02 13867.36 8538.74 5123.97 0.0 239.31 37.7604 2014-12 253.56 2015-05 2016-10 589 585 0 <NA> 1.0 Individual <NA> <NA> <NA> 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0 0.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> False <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Cash False <NA> <NA> <NA> <NA> <NA> <NA>
1651959 584984 10000.0 10000.0 10000.0 36 13.61 339.89 C C2 <NA> 4 years MORTGAGE 57600.0 Verified 2010-09 Does not meet the credit policy. Status:Charge... False https://lendingclub.com/browse/loanDetail.acti... Borrower added on 09/22/10 > I will use the ... debt_consolidation personal loan 324xx FL 9.33 0 2000-11 695 699 4 <NA> <NA> 19 0 7162.0 31.6 28 f 0.0 0.0 3059.85 3059.85 2121.23 920.59 0.0 18.03 5.5 2011-07 339.89 2011-11 2019-03 634 630 0 <NA> 1.0 Individual <NA> <NA> <NA> 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0 0.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> False <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Cash False <NA> <NA> <NA> <NA> <NA> <NA>
1652322 517699 6000.0 6000.0 5975.0 60 17.93 152.14 E E5 Sepaton 4 years RENT 103000.0 Not Verified 2010-05 Does not meet the credit policy. Status:Charge... False https://lendingclub.com/browse/loanDetail.acti... Borrower added on 05/14/10 > Expanding our i... small_business Business expansion 018xx MA 7.07 1 1988-09 665 669 6 10 <NA> 5 0 54.0 13.5 10 f 0.0 0.0 4920.97 4900.53 2433.82 2268.26 0.0 218.89 2.18 2013-01 152.14 2013-05 2016-10 519 515 0 <NA> 1.0 Individual <NA> <NA> <NA> 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0 0.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> False <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Cash False <NA> <NA> <NA> <NA> <NA> <NA>
1652419 501230 15000.0 15000.0 14628.44 36 14.59 516.98 D D1 Van-Dee Manufacturing Co. 10+ years MORTGAGE 65000.0 Verified 2010-04 Does not meet the credit policy. Status:Charge... False https://lendingclub.com/browse/loanDetail.acti... Borrower added on 04/05/10 > Small business ... small_business High Profile new business loan 605xx IL 19.44 0 1989-07 710 714 7 <NA> 115 14 1 51322.0 44.2 26 f 0.0 0.0 3101.88 3029.48 2067.04 1030.94 0.0 3.9 0.0 2010-10 516.98 2011-03 2019-03 664 660 0 <NA> 1.0 Individual <NA> <NA> <NA> 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0 0.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 1 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> False <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Cash False <NA> <NA> <NA> <NA> <NA> <NA>

Nothing jumps out from this small random sample. Rather than trying to guess why certain rows do not meet the credit policy, I'll exclude these rows.

In [20]:
bool_index = filtered_loan_data["loan_status"].str.startswith("Does not meet")
filtered_loan_data = filtered_loan_data[~bool_index]
In [21]:
loan_status_counts = get_value_counts(filtered_loan_data["loan_status"])
display(style_value_counts(loan_status_counts))
  count
loan_status  
Fully Paid 1,076,751
Current 878,317
Charged Off 268,559
Late (31-120 days) 21,467
In Grace Period 8,436
Late (16-30 days) 4,349
Default 40

issue_d¶

The description of this feature is "The month which the loan was funded".

After rows with problematic values of loan_status have been filtered out, there are no missing values for issue_d.

In [22]:
filtered_loan_data["issue_d"].isna().sum()
Out[22]:
0
In [23]:
to_plot = get_group_sizes(filtered_loan_data, group_by="issue_d")
fig = px.line(
    to_plot,
    x="issue_d",
    y="count",
    markers=True,
    labels={"issue_d": "Loan date", "count": "Number of loans"},
    hover_data={"count": ":.3s"},
    title="Number of accepted loans by date",
)
fig.show()

I will exclude pre-2012 dates from the analysis.

Analysis and prediction based on this data will need to take account of changes in behavior over time, and given the relatively small number of loans issued before 2012, it is not worthwhile to include the pre-2012 data.

In [24]:
bool_index = filtered_loan_data["issue_d"] >= "2012-01"
filtered_loan_data = filtered_loan_data[bool_index]
In [25]:
to_plot = get_group_sizes(filtered_loan_data, group_by="issue_d")
fig = px.line(
    to_plot,
    x="issue_d",
    y="count",
    markers=True,
    labels={"issue_d": "Loan date", "count": "Number of loans"},
    hover_data={"count": ":.3s"},
    title="Number of accepted loans by date",
)
fig.show()

loan_amnt / funded_amnt / funded_amnt_inv¶

What is the distinction between loan_amnt, funded_amnt, funded_amnt_inv?

Start by examining the feature descriptions.

In [26]:
amount_features = acc_loan_feat_desc.loc[
    ["loan_amnt", "funded_amnt", "funded_amnt_inv"], ["description"]
]
display(style_loan_summary(amount_features))
  description
column name  
loan_amnt The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
funded_amnt The total amount committed to that loan at that point in time.
funded_amnt_inv The total amount committed by investors for that loan at that point in time.

What should we infer in cases where loan_amnt is different than funded_amnt, or in cases where funded_amnt is different than funded_amnt_inv? It's not completely clear from these descriptions.

Take a closer look at these cases.

In [27]:
# First check for missing values.
for column_name in ["loan_amnt", "funded_amnt", "funded_amnt_inv"]:
    na_count = filtered_loan_data[column_name].isna().sum()
    print(f'\nThe number of missing values for feature "{column_name}" is {na_count}.')
The number of missing values for feature "loan_amnt" is 0.

The number of missing values for feature "funded_amnt" is 0.

The number of missing values for feature "funded_amnt_inv" is 0.
In [28]:
bool_index = (filtered_loan_data["loan_amnt"] - filtered_loan_data["funded_amnt"]) != 0
print(
    '\nThe number of loans with "loan_amnt" different than "funded_amnt" is '
    f"{sum(bool_index)}.\n"
)
partially_funded_loans = filtered_loan_data[bool_index]
The number of loans with "loan_amnt" different than "funded_amnt" is 68.

In [29]:
funding_gap = (
    partially_funded_loans["loan_amnt"] - partially_funded_loans["funded_amnt"]
)
funding_gap.name = "funding_gap"
funding_gap.to_frame().describe()
Out[29]:
funding_gap
count 68.0
mean 7384.191176
std 4779.019572
min 25.0
25% 3843.75
50% 7062.5
75% 10281.25
max 21150.0

The differences between funded_amnt and loan_amnt range from \$25 to over \$21k.

This is consistent with the description of loan_amnt as the amount requested by the borrower.

In [30]:
to_plot = get_group_sizes(partially_funded_loans, group_by="issue_d")
fig = px.scatter(
    to_plot,
    x="issue_d",
    y="count",
    labels={"issue_d": "Loan date", "count": "Number of loans"},
    hover_data={"count": ":,d"},
    title='Date of loans with "loan_amnt" different than "funded_amnt"',
)
fig.show()
In [31]:
bool_index = (
    filtered_loan_data["funded_amnt"] - filtered_loan_data["funded_amnt_inv"]
) != 0
print(
    '\nThe number of loans with "funded_amnt" different than "funded_amnt_inv" is '
    f"{sum(bool_index)}.\n"
)
partially_funded_by_investors = filtered_loan_data[bool_index]
The number of loans with "funded_amnt" different than "funded_amnt_inv" is 129094.

In [32]:
investor_funding_gap = (
    partially_funded_by_investors["funded_amnt"]
    - partially_funded_by_investors["funded_amnt_inv"]
)
investor_funding_gap.name = "investor_funding_gap"
investor_funding_gap.to_frame().describe()
Out[32]:
investor_funding_gap
count 129094.0
mean 94.58058
std 173.332427
min 0.346054
25% 25.0
50% 50.0
75% 100.0
max 28925.0

The differences between funded_amnt_inv and funded_amnt range from less than a dollar to around \$29k.

The fact that the difference is always positive is consistent with the descriptions of these two features.

It's a little surprising to see the value \$0.346054, since the value is specified to a small fraction of a cent. Examine the values of funded_amnt and funded_amnt_inv in cases where the difference between them is less than a dollar.

In [33]:
to_check = partially_funded_by_investors[["funded_amnt", "funded_amnt_inv"]].assign(
    investor_funding_gap=investor_funding_gap
)
to_check[to_check["investor_funding_gap"] < 1]
Out[33]:
funded_amnt funded_amnt_inv investor_funding_gap
523236 14000.0 13999.653946 0.346054
1014559 35000.0 34999.573964 0.426036
1040719 28000.0 27999.0923 0.9077
1481021 19200.0 19199.516656 0.483344
1910970 7850.0 7849.445596 0.554404
1913361 24375.0 24374.467907 0.532093
1913665 9000.0 8999.386905 0.613095
1913748 14125.0 14124.448796 0.551204
1913864 24925.0 24924.456741 0.543259
1914004 21000.0 20999.448797 0.551203

It appears that in cases, funded_amnt_inv was specified with excessive precision.

In [34]:
to_plot = get_group_sizes(partially_funded_by_investors, group_by="issue_d")
fig = px.scatter(
    to_plot,
    x="issue_d",
    y="count",
    labels={"issue_d": "Loan date", "count": "Number of loans"},
    hover_data={"count": ":,d"},
    title='Date of loans with "funded_amnt" different than "funded_amnt_inv"',
)
fig.show()

Discussion:

  • Only 68 of the 2.2 million loans have loan_amnt different than funded_amnt. Essentially all the loans are fully funded.
  • About 130k of the loans have different values for funded_amnt and funded_amnt_inv. Is LendingClub itself providing funding in these case?
  • The funding gaps expressed as (loan_amnt - funded_amnt) and (funded_amnt - funded_amnt_inv) are always positive, as expected from the descriptions of the three features.

While I don't understand the cause of the differences between loan_amnt, funded_amnt, and funded_amnt_inv, I won't filter out the rows with different values for these features. Unlike the rows where loan_status includes the string "Does not meet the credit policy", there isn't a strong indication that rows with different values for loan_amnt, funded_amnt, and funded_amnt_inv are fundamentally problematic.

initial_list_status¶

The feature initial_list_status is explained in this blog post:

The variable initial_list_status is available in the public data and identifies whether a loan was initially listed in the whole (W) or fractional (F) market. Loans listed “whole” become available for fractional funding (and vice versa) if there are no buyers within a certain time frame.

In [35]:
list_status_counts = get_value_counts(filtered_loan_data["initial_list_status"])
display(style_value_counts(list_status_counts))
  count
initial_list_status  
w 1,535,467
f 682,666

Given this explanation of the feature initial_list_status, there's no need to drop the feature or filter out rows based on the value of the feature

int_rate¶

The notebook analysis-01.ipynb presents an in-depth analysis of interest rates for the accepted loans. While doing that analysis, I found that some loans had an anomalously low interest rate, given the poor grade assigned to the loans by LendingClub.

Here I present analysis showing that some interest rates are anomalously low, and I explore possible explanations. Since none of the explanations are well supported by the data, I filter out the corresponding loans.

Note on the code for visualization: I use plotly histograms to highlight the anomalous interest rates, and the binning for these histograms needs to be done outside of plotly. The reason is that plotly does binning in JavaScript, and so unbinned data passed to plotly's histogram function becomes part of the javascript code stored with the notebook. For the current data set, this can increase the notebook size on disk by a factor of more than 100.

In [36]:
min = filtered_loan_data["int_rate"].min()
max = filtered_loan_data["int_rate"].max()
print(
    'The minimum and maximum values of "int_rate" '
    f"are {min} and {max}, respectively."
)
The minimum and maximum values of "int_rate" are 5.31 and 30.99, respectively.
In [37]:
# Define arrays / lists needed for binning the histogram and plotting the bins in
# plotly.
int_rate_bins = np.linspace(5, 31, num=27)
int_rate_bin_labels = [f"{left:d}% - {left+0.99:.2f}%" for left in range(5, 31)]
int_rate_tick_vals = int_rate_bin_labels[0::5]
int_rate_tick_text = [f"{left}%" for left in range(5, 35, 5)]
In [38]:
filtered_loan_data["int_rate_bin"] = pd.cut(
    filtered_loan_data["int_rate"],
    bins=int_rate_bins,
    labels=int_rate_bin_labels,
    right=False,
)
In [39]:
to_plot = get_group_sizes(filtered_loan_data, group_by="int_rate_bin")
fig = px.bar(
    to_plot,
    x="int_rate_bin",
    y="count",
    labels={"count": "Number of loans", "int_rate_bin": "Interest rate"},
    title="Distribution of loan interest rate",
)
customdata = to_plot["int_rate_bin"]
hovertemplate = "Interest rate=%{customdata}<br>Number of loans=%{y:.3s}<extra></extra>"
fig.update_traces(customdata=customdata, hovertemplate=hovertemplate)
fig.update_layout(bargap=0)
fig.update_xaxes(
    tickmode="array", tickvals=int_rate_tick_vals, ticktext=int_rate_tick_text
)
fig.show()
In [40]:
to_plot = get_group_sizes(filtered_loan_data, group_by=["int_rate_bin", "grade"])
fig = px.bar(
    to_plot,
    x="int_rate_bin",
    y="count",
    facet_row="grade",
    labels={
        "count": "Number of loans",
        "int_rate_bin": "Interest rate",
        "grade": "Grade",
    },
    title="Distribution of loan interest rate by loan grade",
    hover_data={"count": ":.3s"},
    height=1200,
)
fig.update_xaxes(tickmode="array", tickvals=int_rate_tick_vals, tickangle=45)
fig.update_yaxes(matches=None, title="")
fig.update_layout(bargap=0, yaxis4_title="Number of Loans")
fig.show()
In [41]:
to_plot = get_group_sizes(
    filtered_loan_data, group_by=["int_rate_bin", "grade", "sub_grade"]
)
to_plot["sub_grade"] = to_plot["sub_grade"].str[1]
fig = px.bar(
    to_plot,
    x="int_rate_bin",
    y="count",
    facet_row="grade",
    color="sub_grade",
    labels={
        "count": "Number of loans",
        "int_rate_bin": "Interest rate",
        "grade": "Grade",
        "sub_grade": "Sub-grade",
    },
    title="Distribution of loan interest rate by loan grade and sub-grade",
    hover_data={"count": ":.3s"},
    height=1200,
)
fig.update_xaxes(tickmode="array", tickvals=int_rate_tick_vals, tickangle=45)
fig.update_yaxes(matches=None, title="")
fig.update_layout(bargap=0, yaxis4_title="Number of Loans")
fig.show()

The previous plots show the following:

  • The interest rate varies systematically with the loan grade.
  • For loans with a high grade, the interest rate varies systemtically with loan sub-grade.
  • For loans with a low grade, the dependence of interest rate on sub-grade is complex.

To get a different view of these patterns, I'll set the y-axis to show percentage of loans rather than number of loans in each category.

In [42]:
to_plot = get_group_sizes(
    filtered_loan_data, group_by=["int_rate_bin", "grade", "sub_grade"]
)
to_plot["sub_grade"] = to_plot["sub_grade"].str[1]
fig = px.histogram(
    to_plot,
    x="int_rate_bin",
    y="count",
    facet_row="grade",
    color="sub_grade",
    labels={
        "count": "Number of loans",
        "int_rate_bin": "Interest rate",
        "grade": "Grade",
        "sub_grade": "Sub-grade",
    },
    barnorm="fraction",
    title="Distribution of loan interest rate by loan grade and sub-grade",
    height=1200,
)


def clean_up_hovertemplate(trace):
    trace.hovertemplate = trace.hovertemplate.replace(
        "sum of Number of loans (normalized as fraction)", "Percentage"
    )


fig.for_each_trace(clean_up_hovertemplate)
fig.update_xaxes(tickmode="array", tickvals=int_rate_tick_vals, tickangle=45)
fig.update_yaxes(title="", tickformat=".0%")
fig.update_layout(bargap=0, yaxis4_title="Number of loans")
fig.show()

The plot above shows the anomalous interest rates. For example, most of the loans in the bottom row, which correspond to the lowest grade, have an interest rate of at least 22%, but the plot also shows a block of loans in the bottom row with interest rate in the range of 6% - 6.99%.

Possible explanations that I explored:

  • These rates are associated with a hardship plan or settlement plan.
  • The funding mechanism for these loans is unusual.
  • These loans are associated with some external event and were all issued at around the same time.
  • An improvement in the borrower's FICO score may have caused the interest rate to be lowered without changing the assigned loan grade.

As shown below, I could not find support in the data for any of these guesses.

First select all of anomalous cases with interest rate in the range of 6% - 6.99%. There are 625 of these loans.

In [43]:
anomalous_int_rate_data = filtered_loan_data[
    (filtered_loan_data["grade"] != "A") & (filtered_loan_data["int_rate"] < 7)
]
In [44]:
print(f"Number of loans: {len(anomalous_int_rate_data.index)}.")
Number of loans: 625.

Check the category of loans with grade 'D' and interest rate 8% - 8.99% shown in the plot above. There is only 1 loan in this category.

In [45]:
bool_index = (filtered_loan_data["grade"] == "D") & (
    filtered_loan_data["int_rate"].between(7.99, 9, inclusive="neither")
)
print(f"Number of loans: {sum(bool_index)}.")
Number of loans: 1.

I will focus on the anomalous cases with interest rate in the range of 6% - 6.99%.

All of these loans have the same rate: 6.0%.

In [46]:
anomalous_rates = list(anomalous_int_rate_data["int_rate"].unique())
print(f"Distinct values of interest rate:  {anomalous_rates}")
Distinct values of interest rate:  [6.0]

Only a few are associated with a hardship plan or settlement plan.

In [47]:
display(
    style_value_counts(get_value_counts(anomalous_int_rate_data["hardship_status"]))
)
display(
    style_value_counts(get_value_counts(anomalous_int_rate_data["settlement_term"]))
)
  count
hardship_status  
<NA> 621
COMPLETED 3
BROKEN 1
  count
settlement_term  
<NA> 622
1 2
18 1

All of these loans are fully funded. For 46 of the loans, the value of funded_amnt is different from the value of funded_amnt_inv, but most of the loans are fully funded by investors.

In [48]:
bool_index = (
    anomalous_int_rate_data["loan_amnt"] - anomalous_int_rate_data["funded_amnt"]
) != 0

print(f"The number of these loans that are not fully funded is {sum(bool_index)}.")
The number of these loans that are not fully funded is 0.
In [49]:
bool_index = (
    anomalous_int_rate_data["funded_amnt"] - anomalous_int_rate_data["funded_amnt_inv"]
) != 0
print(
    'The number of these loans with "funded_amnt" different than "funded_amnt_inv" '
    f"is {sum(bool_index)}."
)
The number of these loans with "funded_amnt" different than "funded_amnt_inv" is 46.

The dates of these loans are spread over the full ranges of dates of the filtered data, so they do not appear to be associated with an external event.

In [50]:
to_plot = get_group_sizes(anomalous_int_rate_data, group_by="issue_d")
fig = px.scatter(
    to_plot,
    x="issue_d",
    y="count",
    labels={"issue_d": "Loan date", "count": "Number of loans"},
    title="Dates of loans with anomalous interest rate 6.0%",
)
fig.show()

Look at a random sample of these records and check whether the FICO score improved from the value it had at loan origination.

In [51]:
# Define a reproducible random state for sampling randomly
rng = np.random.default_rng(seed=16513)
In [52]:
sampled_data = anomalous_int_rate_data.sample(10, random_state=rng)
In [53]:
fico_columns = [
    "fico_range_low",
    "fico_range_high",
    "last_fico_range_low",
    "last_fico_range_high",
]
style_loan_summary(sampled_data[fico_columns])
Out[53]:
  fico_range_low fico_range_high last_fico_range_low last_fico_range_high
2210886 675 679 725 729
50250 720 724 690 694
18516 690 694 685 689
1756456 660 664 625 629
1301075 695 699 700 704
89739 715 719 680 684
1012395 695 699 640 644
1290147 705 709 710 714
594599 695 699 655 659
2046566 745 749 605 609

The columns fico_range_low and fico_range_high correspond to values at loan origination, while the columns last_fico_range_low and last_fico_range_high correspond to a more recent check of the credit rating. This random sample does not show a systematic jump in FICO scores for the loans with anomalously low interest rates.

Inspect the columns of the data frame for a patterns that might explain the low rates.

In [54]:
with pd.option_context("display.max_columns", None):
    display(sampled_data)
id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan url desc purpose title zip_code addr_state dti delinq_2yrs earliest_cr_line fico_range_low fico_range_high inq_last_6mths mths_since_last_delinq mths_since_last_record open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d last_pymnt_amnt next_pymnt_d last_credit_pull_d last_fico_range_high last_fico_range_low collections_12_mths_ex_med mths_since_last_major_derog policy_code application_type annual_inc_joint dti_joint verification_status_joint acc_now_delinq tot_coll_amt tot_cur_bal open_acc_6m open_act_il open_il_12m open_il_24m mths_since_rcnt_il total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m acc_open_past_24mths avg_cur_bal bc_open_to_buy bc_util chargeoff_within_12_mths delinq_amnt mo_sin_old_il_acct mo_sin_old_rev_tl_op mo_sin_rcnt_rev_tl_op mo_sin_rcnt_tl mort_acc mths_since_recent_bc mths_since_recent_bc_dlq mths_since_recent_inq mths_since_recent_revol_delinq num_accts_ever_120_pd num_actv_bc_tl num_actv_rev_tl num_bc_sats num_bc_tl num_il_tl num_op_rev_tl num_rev_accts num_rev_tl_bal_gt_0 num_sats num_tl_120dpd_2m num_tl_30dpd num_tl_90g_dpd_24m num_tl_op_past_12m pct_tl_nvr_dlq percent_bc_gt_75 pub_rec_bankruptcies tax_liens tot_hi_cred_lim total_bal_ex_mort total_bc_limit total_il_high_credit_limit revol_bal_joint sec_app_fico_range_low sec_app_fico_range_high sec_app_earliest_cr_line sec_app_inq_last_6mths sec_app_mort_acc sec_app_open_acc sec_app_revol_util sec_app_open_act_il sec_app_num_rev_accts sec_app_chargeoff_within_12_mths sec_app_collections_12_mths_ex_med sec_app_mths_since_last_major_derog hardship_flag hardship_type hardship_reason hardship_status deferral_term hardship_amount hardship_start_date hardship_end_date payment_plan_start_date hardship_length hardship_dpd hardship_loan_status orig_projected_additional_accrued_interest hardship_payoff_balance_amount hardship_last_payment_amount disbursement_method debt_settlement_flag debt_settlement_flag_date settlement_status settlement_date settlement_amount settlement_percentage settlement_term int_rate_bin
2210886 92449578 25725.0 25725.0 25725.0 60 6.0 495.59 C C5 Logistics 10+ years MORTGAGE 72000.0 Verified 2016-11 Current False https://lendingclub.com/browse/loanDetail.acti... <NA> debt_consolidation Debt consolidation 581xx ND 25.64 1 1994-09 675 679 0 14 <NA> 29 0 22715.0 42.4 49 w 14621.67 14621.67 14802.01 14802.01 11103.33 3698.68 0.0 0.0 0.0 2019-03 495.59 2019-04 2019-03 729 725 0 <NA> 1.0 Joint App 102000.0 24.37 Not Verified 0 0.0 304691.0 1 6 0 1 18 21660.0 63.0 1 7 6533.0 50.0 53600.0 0 0 2 10 11719.0 13142.0 59.4 0 0.0 265 119 3 3 3 21 <NA> 1 <NA> 0 10 11 11 16 13 22 33 11 28 0 0 0 2 95.9 41.7 0 0 352077.0 44375.0 32400.0 34546.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> False <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Cash False <NA> <NA> <NA> <NA> <NA> <NA> 6% - 6.99%
50250 65673954 2500.0 2500.0 2500.0 36 6.0 77.64 C C1 courtesy associate 10+ years OWN 55000.0 Source Verified 2015-11 Fully Paid False https://lendingclub.com/browse/loanDetail.acti... <NA> debt_consolidation Debt consolidation 270xx NC 16.71 0 2000-06 720 724 0 58 <NA> 11 0 5032.0 29.6 27 w 0.0 0.0 2983.286179 2983.29 2500.0 483.29 0.0 0.0 0.0 2018-12 0.03 <NA> 2018-12 694 690 0 <NA> 1.0 Individual <NA> <NA> <NA> 0 0.0 42504.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 17000.0 <NA> <NA> <NA> 6 4723.0 3729.0 42.6 0 0.0 174 185 4 3 0 4 <NA> 17 <NA> 0 1 3 3 4 15 9 12 3 11 0 0 0 3 96.3 33.3 0 0 55388.0 42504.0 6500.0 38388.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> False <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Cash False <NA> <NA> <NA> <NA> <NA> <NA> 6% - 6.99%
18516 66024347 13675.0 13675.0 13675.0 60 6.0 272.39 E E4 <NA> <NA> RENT 36000.0 Verified 2015-12 Fully Paid False https://lendingclub.com/browse/loanDetail.acti... <NA> credit_card Credit card refinancing 748xx OK 32.3 0 2007-04 690 694 0 43 <NA> 8 0 3508.0 25.1 22 w 0.0 0.0 16505.636346 16505.64 13675.0 2830.64 0.0 0.0 0.0 2017-11 8208.82 <NA> 2018-10 689 685 0 43 1.0 Individual <NA> <NA> <NA> 0 0.0 25090.0 1 3 1 3 3 21582.0 81.0 2 3 1277.0 62.0 14000.0 1 0 2 6 3136.0 2492.0 42.0 0 0.0 104 97 7 3 0 7 <NA> 7 <NA> 1 2 3 3 3 17 5 5 3 8 0 0 0 3 86.4 33.3 0 0 40567.0 25090.0 4300.0 26567.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> False <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Cash False <NA> <NA> <NA> <NA> <NA> <NA> 6% - 6.99%
1756456 9836985 14675.0 14675.0 14675.0 36 6.0 455.52 C C3 Maintenance Tech 1 year RENT 55000.0 Not Verified 2013-12 Fully Paid False https://lendingclub.com/browse/loanDetail.acti... <NA> credit_card Consolidation 444xx OH 12.55 2 2008-08 660 664 0 9 <NA> 9 0 6852.0 57.1 13 f 0.0 0.0 16942.459999 16942.46 14675.0 2242.03 25.43 0.0 0.0 2015-09 7223.39 <NA> 2017-07 629 625 0 <NA> 1.0 Individual <NA> <NA> <NA> 0 0.0 25886.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 12000.0 <NA> <NA> <NA> 2 2876.0 4056.0 61.7 0 0.0 51 64 3 3 0 3 9 15 9 0 4 6 4 5 6 6 7 6 9 0 0 0 1 83.3 25.0 0 0 33065.0 25886.0 10600.0 21065.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> False <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Cash False <NA> <NA> <NA> <NA> <NA> <NA> 6% - 6.99%
1301075 13408227 15600.0 15600.0 15600.0 60 6.0 317.38 C C3 POLICE OFFICER 8 years MORTGAGE 71000.0 Source Verified 2014-04 Fully Paid False https://lendingclub.com/browse/loanDetail.acti... <NA> debt_consolidation Debt consolidation 337xx FL 26.08 0 1999-11 695 699 0 <NA> <NA> 18 0 16189.0 40.1 47 w 0.0 0.0 20398.094502 20398.09 15600.0 4798.09 0.0 0.0 0.0 2017-07 6606.72 <NA> 2017-12 704 700 0 <NA> 1.0 Individual <NA> <NA> <NA> 0 0.0 241875.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 40400.0 <NA> <NA> <NA> 10 13438.0 5508.0 53.7 0 0.0 172 113 10 8 4 15 <NA> 12 <NA> 0 6 8 8 14 20 13 23 8 18 0 0 0 4 100.0 37.5 0 0 280224.0 67895.0 11900.0 59924.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> False <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Cash False <NA> <NA> <NA> <NA> <NA> <NA> 6% - 6.99%
89739 63326688 25000.0 25000.0 24950.0 60 6.0 507.9 D D2 Attorney 3 years MORTGAGE 73600.0 Source Verified 2015-10 Current False https://lendingclub.com/browse/loanDetail.acti... <NA> debt_consolidation Debt consolidation 324xx FL 22.68 0 2000-10 715 719 1 <NA> <NA> 22 0 12795.0 95.5 41 w 9644.51 9625.22 23116.93 23070.7 15355.49 7761.44 0.0 0.0 0.0 2019-03 507.9 2019-04 2019-03 684 680 0 <NA> 1.0 Individual <NA> <NA> <NA> 0 0.0 293111.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 13400.0 <NA> <NA> <NA> 8 13323.0 1.0 99.9 0 0.0 180 136 5 3 2 96 <NA> 5 <NA> 0 1 3 1 1 34 3 5 3 22 0 0 0 3 100.0 100.0 0 0 351325.0 137837.0 1000.0 166313.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> False <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Cash False <NA> <NA> <NA> <NA> <NA> <NA> 6% - 6.99%
1012395 75299976 30000.0 30000.0 30000.0 60 6.0 686.58 F F3 Management Assistant 10+ years MORTGAGE 79098.0 Source Verified 2016-03 Current False https://lendingclub.com/browse/loanDetail.acti... <NA> debt_consolidation Debt consolidation 221xx VA 17.92 0 2002-01 695 699 0 64 <NA> 11 0 3385.0 30.0 23 w 16118.06 16118.06 29411.19 29411.19 13881.94 15294.51 234.74 0.0 0.0 2019-02 686.58 2019-04 2019-03 644 640 0 64 1.0 Individual <NA> <NA> <NA> 0 0.0 367052.0 2 5 2 4 5 84304.0 124.0 1 2 2005.0 110.0 11300.0 1 1 2 7 33368.0 4915.0 40.8 0 0.0 104 170 11 5 2 11 81 5 81 2 3 3 3 5 11 4 9 3 11 0 0 0 4 90.9 0.0 0 0 363394.0 87689.0 8300.0 68047.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> False INTEREST ONLY-3 MONTHS DEFERRAL FAMILY_DEATH BROKEN 3 532.15 2017-08 2017-09 2017-08 3 26 Late (16-30 days) <NA> 26593.67 26.13 Cash False <NA> <NA> <NA> <NA> <NA> <NA> 6% - 6.99%
1290147 14530462 26000.0 26000.0 26000.0 60 6.0 511.92 D D1 Budget Analyst 10+ years MORTGAGE 75000.0 Verified 2014-04 Fully Paid False https://lendingclub.com/browse/loanDetail.acti... <NA> debt_consolidation Debt consolidation 226xx VA 24.94 1 2000-10 705 709 0 13 <NA> 16 0 19651.0 52.6 44 w 0.0 0.0 32326.93 32326.93 26000.0 6326.93 0.0 0.0 0.0 2016-03 17799.6 <NA> 2019-03 714 710 0 <NA> 1.0 Individual <NA> <NA> <NA> 0 0.0 70119.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 33800.0 <NA> <NA> <NA> 7 5843.0 7349.0 68.8 0 0.0 162 162 10 3 4 10 13 12 13 0 3 3 5 18 8 12 32 3 16 0 0 0 3 95.2 50.0 0 0 99474.0 70119.0 27000.0 65674.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> False <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Cash False <NA> <NA> <NA> <NA> <NA> <NA> 6% - 6.99%
594599 116207885 20000.0 20000.0 20000.0 60 6.0 321.16 E E4 Leasing Credit Officer 10+ years RENT 70000.0 Source Verified 2017-08 Current False https://lendingclub.com/browse/loanDetail.acti... <NA> medical Medical expenses 606xx IL 11.33 0 2006-07 695 699 1 26 <NA> 15 0 10065.0 33.4 31 w 11878.81 11878.81 14049.25 14049.25 8121.19 5928.06 0.0 0.0 0.0 2019-03 321.16 2019-04 2019-03 659 655 0 <NA> 1.0 Individual <NA> <NA> <NA> 0 0.0 308544.0 2 5 0 2 17 42244.0 96.0 0 0 3084.0 71.0 30100.0 4 2 3 4 20570.0 13601.0 18.6 0 0.0 133 132 35 2 3 61 <NA> 0 <NA> 0 2 6 3 4 19 8 9 6 15 0 0 0 2 93.1 0.0 0 0 328931.0 52309.0 16700.0 43920.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> False <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Cash False <NA> <NA> <NA> <NA> <NA> <NA> 6% - 6.99%
2046566 126149005 10000.0 10000.0 10000.0 36 6.0 292.38 D D1 Lead 9 years RENT 60000.0 Source Verified 2017-12 Current False https://lendingclub.com/browse/loanDetail.acti... <NA> debt_consolidation Debt consolidation 104xx NY 26.86 0 2010-11 745 749 2 <NA> <NA> 9 0 3378.0 15.7 11 w 5814.79 5814.79 6026.72 6026.72 4185.21 1787.96 53.55 0.0 0.0 2019-03 999.19 2019-04 2019-03 609 605 0 <NA> 1.0 Individual <NA> <NA> <NA> 0 0.0 66278.0 3 2 1 1 3 62900.0 98.0 2 3 1416.0 53.0 21500.0 0 2 5 4 9468.0 5093.0 24.0 0 0.0 53 85 1 1 0 1 <NA> 1 <NA> 0 4 5 6 7 3 7 8 5 9 0 0 0 3 100.0 25.0 0 0 104891.0 66278.0 6700.0 83391.0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> False <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Cash False <NA> <NA> <NA> <NA> <NA> <NA> 6% - 6.99%

I don't see any interesting patterns in this sample, so I will filter out the 626 rows that have anomalously low interest rates.

In [55]:
int_rate_is_anomalous = (filtered_loan_data["grade"] != "A") & (
    filtered_loan_data["int_rate"] < 7
)
int_rate_is_anomalous = int_rate_is_anomalous | (
    (filtered_loan_data["grade"] == "D") & (filtered_loan_data["int_rate"] < 9)
)
filtered_loan_data = filtered_loan_data[~int_rate_is_anomalous]

hardship_type, hardship_length, and deferral_term¶

The features hardship_type, hardship_length, and deferral_term each have only one value (other than <NA>).

In [56]:
hardship_type_counts = get_value_counts(filtered_loan_data["hardship_type"])
display(style_value_counts(hardship_type_counts))
  count
hardship_type  
<NA> 2,206,594
INTEREST ONLY-3 MONTHS DEFERRAL 10,913
In [57]:
hardship_length_counts = get_value_counts(filtered_loan_data["hardship_length"])
display(style_value_counts(hardship_length_counts))
  count
hardship_length  
<NA> 2,206,594
3 10,913
In [58]:
deferral_term_counts = get_value_counts(filtered_loan_data["deferral_term"])
display(style_value_counts(deferral_term_counts))
  count
deferral_term  
<NA> 2,206,594
3 10,913

It appears that loans with <NA> in these columns are not associated with a hardship plan, so each of these columns can be used as a flag for hardship plans.

If these columns are dropped, can we infer from other columns which loans are associated with a hardship plan?

Yes, because a loan has hardship_amount equal to <NA> if and only if hardship_type, hardship_length, and deferral_term are <NA>.

In [59]:
hardship_amount_isna = filtered_loan_data["hardship_amount"].isna()
matching_na = (
    hardship_amount_isna.equals(filtered_loan_data["hardship_type"].isna())
    and hardship_amount_isna.equals(filtered_loan_data["hardship_length"].isna())
    and hardship_amount_isna.equals(filtered_loan_data["deferral_term"].isna())
)
if matching_na:
    print(
        "The columns hardship_amount, hardship_type, hardship_length, "
        "and deferral term have <NA> values at the same positions."
    )
else:
    print("Mismatch in <NA> values.")
The columns hardship_amount, hardship_type, hardship_length, and deferral term have <NA> values at the same positions.

Filter data¶

Taking account of the feature summaries and the feature exploration above, certain columns will be excluded from the analysis of accepted loans.

  • url: URL for the LC page with listing data
  • title: The loan title provided by the borrower
  • desc: Loan description provided by the borrower
  • policy_code: publicly available policy_code=1, new products not publicly available policy_code=2
  • hardship_type: Describes the hardship plan offering
  • hardship_length: The number of months the borrower will make smaller payments than normally obligated due to a hardship plan
  • deferral_term: Amount of months that the borrower is expected to pay less than the contractual monthly payment amount due to a hardship plan

Also, rows will be featured out based on the following criteria:

  • Problematic values for loan_status
    1. <NA>
    2. Does not meet the credit policy. Status:Fully Paid
    3. Does not meet the credit policy. Status:Charged Off
  • Values of issue_d before 2012
  • Anomalously low values of int_rate
In [60]:
to_drop = [
    "url",
    "title",
    "desc",
    "policy_code",
    "hardship_type",
    "hardship_length",
    "deferral_term",
]
filtered_loan_data = filtered_loan_data.drop(
    to_drop,
    axis="columns",
)